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(54) Dynamic sprites for encoding video data 

(57) Video data representing a scene are proc- 
essed to improve encoding efficiencies. The video data 
are segmented into rigidly and non-rigidiy moving video 
objects. The segmentation is performed by estimating 
local motion vectors for the video data of a sequence of 
frames. The local motion vectors are clustered to deter- 
mine dominant motions, and video data having motion 
vectors similar to the dominant motions are segmented 
out as rigidly moving video objects. For these objects, 
motion parameters are robustly estimated. Using the 
motion parameters, the rigid video objects of the frames 



are integrated in one or more corresponding sprites 
stored in a long-term memory The sprites can be used 
in a two-way motion compensated prediction technique 
for encoding video data where blocks of video data are 
encoded either from sprites based on rigid motion 
parameters, or from a previous frame based on local 
motion parameters. In addition, the sprites can be used 
to construct a high resolution panoramic view of the 
background of a scene. 
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Description 

FIELD OF THE INVENTION 

5 This invention is related to data processing, and more particularly to processing video data. 

BACKGROUND OF THE INVENTION 

An observable scene in the real physical world can be represented as temporally and spatially related video data 
w in a memory of a computer system. The video data represent the optical flow of the scene over time and space as var- 
ying light intensity values. 

Efficient integration and representation of the video data is a key component in video encoding schemes, for exam- 
ple MPEG encoding. Video encoding can reduce the amount of memory used to store the data, as well as reduce the 
time required to process and communicate the data. 

is Scenes are generally composed of foreground and background components. Typically, the optical flow of the 
smaller foreground is in motion with respect to the larger and relatively static background. For example, in a stadium 
scene, a player or players will move back and forth against the background of the grandstands while the camera pans 
and zooms following the player. Various parts of the background are revealed as the player moves in the foreground. 
One of the more pertinent problems of video encoding deals with encoding uncovered background of scenes. Clas- 

20 sical motion compensated prediction schemes are unable to predict newly revealed background areas, and therefore, 
encoding is inefficient. 

To overcome this problem, background memory techniques are known. More precisely, these techniques identify still 
regions of the scene as background which can be stored in a long-term memory. Whenever a background area is 
uncovered, and providing that the background area has previously been observed, data unavailable with classical pre- 

25 diction techniques can be retrieved from the background memory. 

These techniques are effective for video-conferencing or video-phone sequences which are typically characterized 
by a still background. However, the model of a static background does not hold true for more complex scenes which 
include camera motion, e.g., panning or zooming, and multiple moving objects. 

In order to integrate temporal information, a mosaic representation, also referred to as a salient still, has been 

30 shown to be efficient.' Basically, these techniques estimate the camera motion using global motion estimation, and align 
successive images, e.g., frames, in a video sequence by canceling contributions due to camera motion. A mosaic is 
built by temporally integrating the aligned frames or images in a memory. In this way, the mosaic captures the informa- 
tion in multiple aligned frames of a video sequence. 

However, these techniques are typically applied without distinction between the background and foreground of the 

35 scene. That is, the camera motion is only representative of the background motion. Therefore, the foreground appears 
blurred. Furthermore, as the foreground is integrated into the mosaic representation, the problem of uncovered back- 
ground remains unsolved. Also, as far as video encoding is concerned, no residual signals are transmitted, leading to 
noticeable coding artifacts. 

Sprites are well-known in the field of computer graphics. Sprites correspond to synthetic video objects which can 
40 be animated, and overlaid onto a synthetic or natural scene. For instance, sprites are widely used in video games. More 
recently, sprites have been proposed for video encoding. 

The sprite can either be a synthetic or natural object determined from a sequence of images. In the latter case, the 
technique is applied to a video object whose motion can be modeled by a rigid body motion. Since the technique is 
applied to a coherently moving body, instead of the entire frame, some of the problems of frame aligned mosaicking are 
45 alleviated. 

Generally, these techniques require that the sprite is identified and available before the encoding of a sequence of 
images begins. The sprite can be encoded using intraframe coding techniques. The encoded sprite can then be trans- 
mitted along with rigid body motion information to a decoder. The decoder can warp the sprite using the motion infor- 
mation to render the scene. 

so In the case of natural scenes, an analysis stage is required prior to encoding in order to build the static sprite. Dur- 
ing the analysis, segmentation, global estimation and warping are performed on the video data of the frames. As a 
result, this method introduces a very significant delay, since a large number of frames needs to be analyzed to build the 
static sprite. For many real-time encoding applications this delay is unacceptable. 

Taking the above into consideration, it is desired to provide a method and apparatus which can dynamically build a 

55 sprite in a memory. Furthermore, the method should be suitable for real-time video data encoding. 
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SUMMARY OF THE INVENTION 

Disclosed is a software implemented method for dynamically building one or more sprites of a scene. A sprite is 
defined as a region of the scene, i.e., a video object, where the motion is coherent or rigid. For example, a static back- 
5 ground portion of the scene can be moving rigidly due to a camera panning the scene. 

Straightforwardly, since the method compensates for the rigid motion of the video objects, it overcomes the short- 
comings of classical background memory techniques. Furthermore, since the temporal integration is only performed on 
rigidly moving objects, the method outperforms classical mosaic representations. 

The method includes three main stages. First, a discrimination is made between regions of the scene which have 
10 rigid and non-rigid motion. For example, the background can be a rigidly moving object due to camera motion. Similarly, 
a train moving across the background can be another rigidly moving object. The discrimination is based on local motion 
information. Second, the parameters which described the rigid motion are estimated for each identified video object. 
Third, using the rigid motion information, a dynamic sprite is built for each identified video object by progressively inte- 
grating data representing the video object in a long-term memory. Each of these stages represents challenging prob- 
15 lems, a detailed and efficient solution to which is described in detail below. 

Using the disclosed mosaicking technique, permits an efficient integration and representation of motion informa- 
tion. More specifically, a two-way motion compensation prediction is disclosed. A current frame can be predicted either 
from memory using rigid motion estimates, or from a previous frame using local motion estimates. 

Therefore, this method handles, to a large extent, the problem of dealing with uncovered regions of a scene. Fur- 
20 thermore, the amount of additional information required to represent motion information is greatly reduced due to rep- 
resenting motion in a compact parametric form. 

Moreover, these gains are obtained without introducing any encoding delays. The method can be applied inde- 
pendent from any subsequent encoding techniques. In particular, though the method is applicable to classical motion 
compensated block-DCT encoding schemes, it is also appropriate for object-based representation of a scene. Finally, 
25 the method can be used to provide a high resolution panoramic view of background components of scenes. 

The invention, in its broad form, resides in a method and apparatus for processing video data representing a scene, 
as recited in claims 1 and 1 1 respectively. 

BRIEF DESCRIPTION OF THE DRAWINGS . 

30 . 

A more detailed understanding of the invention can be had from the following description of a preferred embodi- 
ment, given by way of example, and to be understood in conjunction with the accompanying drawings, in which: 

♦ Figure 1 is an arrangement which uses video processing according to a preferred embodiment of the invention; 
35 ♦ Figure 2 is a block diagram of a process for dynamically building a sprite; 

♦ Figure 3 is flow diagram of a process for segmenting video data into video objects having rigid and non-rigid motion; 

♦ Figures 4A and 4B show forward and backward transforms to warp pixel intensity values; 

♦ Figure 5 shows a dynamic integration of a sequence of frames into a sprite; 

♦ Figure 6 is a block diagram of the relative sizes of a frame and a dynamic sprite; 

40 ♦ Figure 7 is a flow diagram of a process for encoding and decoding video data using a dynamic sprite, and 

♦ Figure 8 is a two-way motion prediction technique for video data encoding. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 
45 System Arrangement 

Figure 1 shows an arrangement 100 which uses a method and apparatus for dynamically generating one or more 
"sprites." Each sprite can be built in real-time as video data are encoded and decoded. The method and apparatus are 
suitable for low bit rate video data transmission. 
so The arrangement 100 includes a camera 1 10 for acquiring light intensity values of a real scene 10 as video data. If 
the camera 100 is analog, an A/D converter 120 can be used to generate representative digital signals. 

In any case, a temporal sequence of images or frames 130 is produced. Each frame includes a plurality of picture 
elements (pixels) arranged in a regularized pattern. Each pixel represents a light intensity value of part of the scene 10. 
The frames 1 30 are presented to a video processing system 140. 
55 The system 140 can be in the form of a modern workstation including one or more processors 1 41 , a first memory 
(DRAM) 142, and a second memory (DISK) 143, connected to each other by a bus 144. A portion of the memories can 
be allocated as memory for building the one or more sprites. 

The processing system 140 can be connected via a network 160 to a decoding system 170 via an input/output 
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interface (I/O) 145. The decoding system 170 can be connected to a display device 180 for generating reconstructed 
images of the scene 10. The decoding system can also be in the form of a workstation, or the decoding system can be 
a set-top box, and the display device can be a television. 

In a general sense, the arrangement 100 does nothing more than receive input photons as light intensity values, 
5 electronically represent the values as pixels of video data, and process the video data to produce output photons rep- 
resentative of the i nput. 

During operation, the system 140 processes the images to dynamically generate one or more sprites in the mem- 
ories. Furthermore, the system 140 can encode intraframes and motion information in real-time. The encoded data can 
be transmitted to the decoder 1 70 via the network 1 60 for reconstruction on the display device 1 80. 

10 

Overview of Process 

More specifically as shown in Figure 2, instructions of programs comprise a computer implemented method 200 for 
dynamically building one or more sprites representative of portions of the scene 10 without introducing any delays. In 

is step 205, a set of local motion vectors are estimated for each frame of the sequence 130 of Figure 1 . 

In step 210, the video data of each frame are segmented into video objects having rigid and non-rigid motion. For 
example, the static background of a scene may be moving rigidly if the scene is viewed by a panning camera. A train 
moving across the background may be another rigidly moving object. Similarly, a car speeding to catch up with the train 
may be another rigidly moving object. 

20 The video data are segmented using the local motion estimation of step 205. In step 220, motion information, e.g., 
motion parameters, is estimated for the video objects segmented as having rigid motion. The motion parameters are 
applied to the video data to compensate for, for example, camera motion in step 230. The compensated data can tem- 
porally be integrated as one or more dynamic sprites stored in the memories during step 240. 

25 Segmentation 

Figure 3 shows a process 300 for segmenting video data representing the scene 10 of Figure 1 into rigidly and non- 
rigidly moving video objects. In step 310, a set of local motion vectors 301 are estimated for each frame. There is one 
motion vector for each pixel. Motion vectors can be determined by comparing pixels of a current frame with correspond- 
30 ing pixels of a previous frame. 

The local motion vectors 301 are clustered into subsets of vectors in step 320 to determine one or more dominant 
motions. In step 330, regions having motion similar to the dominant motions are identified. If a motion vector is similar 
to one of the dominant motions, the associated pixels are classified in step 340 as belonging to the corresponding rig- 
idly moving video object 302. 

35 However, in low contrast regions, the motion vector estimation is unconstrained, and the local motion vectors are 
generally unreliable. Therefore, the discrimination based on motion similarity of step 330 may fail. To overcome this dif- 
ficulty, the residual information present in the predication error during the estimation can be exploited in step 350. 

Specifically, if the residual information obtained for the dominant motion is similar to the residual information 
obtained for the local motion vectors, then the region is classified as moving rigidly, and otherwise as moving non-rigidly 

40 303, step 360. In this way, low contrast areas, where motion estimation is unreliable, can robustly be segmented as 
either rigidly or non-rigidly moving video objects. 

Motion Estimation 

45 After segmenting the scene into rigidly and non-rigidly moving video objects, the motion information for the rigidly 
moving video objects is robustly estimated. Since the non-rigidly moving objects have been segregated out, the estima- 
tion of the rigid motion is not spoiled by the presence of outliers due to foreground objects whose motion is not repre- 
sented by the dominant motions, e.g., the camera motion. 

The motion information can be represented by a parametric motion modal. For this purpose, two models can gen- 
50 erally be used, a six parameter affine model, or a nine (eight of which are independent) perspective model. 

The six parameter affine model allows for the representation of the motion of a planar surface under orthographic 
projection. However, the model assumes that the depth of the scene 10 of Figure 1 is small when compared with the 
distance from the scene 10 to the camera. 

Therefore, in order to relax this constraint, the nine parameter perspective model is preferred. This model allows for 
55 the representation of the motion of a planar surface under perspective projection. 

In order to robustly estimate the parameters of the motion information, a matching technique can be used. This 
technique generally outperforms differential and linear regression techniques. The motion parameters (a?/..., a9) are 
obtained by minimizing a disparity measurement between a region of a current frame and a corresponding region of a 
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previous frame. A search for the minimum is carried out in an n-dimensional parametric space. 

To decrease the computational complexity, a fast non-exhaustive search can be carried out where the rigid motion 

parameters are estimated progressively. First, the translational component is computed, that is, a1 and a4. Then, the 

affine parameters, a2, a3 t a5 and aS, and finally the perspective parameters a7, a8, and a3. 
5 Besides decreasing complexity, this search strategy tends to produce more robust estimates, as the translational 

component usually carries more information than the other parameters, which often reduce to little more than noise. 

However, this search strategy obviously induces the risk of being trapped in a local minimum. 

Therefore, the matching motion estimation algorithm can be applied on a multiresolution structure based on a 

Gaussian pyramid. With the Gaussian pyramid, the final motion parameters at one level of the pyramid can be propa- 
io gated to be an initial estimate for a next level of the pyramid. This multiresolution scheme allows for the reduction of the 

computational load, as well as the prevention of local minima due to the non-exhaustive search. 



Dynamic Sprites 

is Once the scene has been segmented, and the motion estimated, the video data representing rigidly moving video 
objects are temporally integrated by mosaicking. There is one sprite for each identified video object. A distinction is 
made between static and dynamic sprites. A static sprite integrates all motion information of all of the frames in a 
sequence, e.g., mean, median, weighted mean, and weighted median. 

In contrast, a dynamic sprite, as described herein, corresponds to a progressive update of the sprite content by 

20 gradually integrating the motion information of selected portions of individual frames, e.g., rigidly moving video objects. 
Since the static sprite puts more constraints on the video and coding scheme, such as having high encoding delay, and 
the buffering of many frames, the dynamic sprite is preferred. 

Temporal Warping for Pixel Alignment 

25 

To build the sprites, the frames are aligned with respect to a coordinate system. The coordinate system can corre- 
spond either to a fixed reference, such as the first frame or a preferred view, or to a time-varying reference, for example, 
the current frame. The selected alignment can depend on the intended use. 

For a predictive encoding application, where it is desired to encode a next current frame depending on previous 
30 frames, the time-varying reference is preferred. In the case where the goal is to build a panoramic view, the fixed refer- 
ence can be used. 

More formally, the motion parameters between time M and time t are denoted by A(t-1 , t) = (ah .... a9) . These 
parameters define the relationship between pixel coordinates r = (x f y) at time f, and coordinates r' = (x\ y') at time f- 
1. 

35 As shown in Figures 4a and 4b, a forward transform 7} can warp the pixels of a previous frame 401 at time t-1 to 
corresponding pixels of a current frame 402 at time t . Similarly, a backward transform T b can warp the pixels from the 
current frame 402 to the previous frame 401 . Since it is unlikely that the coordinates of the warped pixels will coincide 
with a selected sampling grid, an interpolation filter can be used to resample the intensity values of the pixels at their 
new coordinates. 

40 

Mosaicking 

As shown in Figure 5, knowing the rigid motion parameters, frames 1 31 -136 of the sequence 1 30 can be temporally 
aligned with respect to a coordinate system before being integrated into a dynamically evolving sprite, e.g., 501-505. 
45 The coordinate system, as stated above, can correspond to either a fixed reference, e.g., the first frame or a preferred 
v iew, or a time-var ying referenc e._e.g M Jhe.currenUram 

intended use of the sprite. The latter is preferred for predictive coding, and the former may be preferred where the goal 
is to build a panoramic view of the scene 10. 

Figure 6 shows a sprite 600 which includes pixel intensity values of a frames 602 in order to generate a panoramic 
so view of a scene. The size of the sprite 600 is significantly larger than the size of the frames 602 which are integrated in 
order to form the sprite. In the example shown, the rigid motion is downward, and to the right. 

The spatial location r of a pixel 601 of one of the frames 602 corresponds to a spatial location R in the sprite 600, 
where R is a function of r. That is. there is a function f( ) such that R = f(rj, and r=f' 1 (R). For instance. f( ) can be 
chosen such that the frame being integrated is centered in the sprite 600. 

55 

Fixed Reference 

First, the case of a fixed reference is considered. For example, the fixed coordinate system of the dynamic sprite 
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corresponds to the first frame at a time t 0 = 0. Alternatively, the fixed reference can correspond to the coordinate system 
of a preferred view. 

Using the backward transform of Figure 4b, the pixels of a current frame at time t are warped to this fixed reference 
so that the pixels can be integrated with previously stored pixels of the sprite at time t-1 to generate a new sprite at time 
5 t. If no pixels are stored in the memory for the current frame, the pixels of the current frame can be used as initial values 
for this portion of the sprite. 

As stated above, the coordinates of the warped pixels are unlikely to coincide with the chosen sampling grid for the 
sprite. Therefore, new pixel intensity values are interpolated and normalized as necessary. For a fixed reference, bilin- 
ear interpolation is preferred. 

10 

Time-Varying Reference 

In the case of the time-varying reference, the sprite at time t-1 is warped to the coordinate system of the current 
frame using the forward transform T f of Figure 4a. Then, the current frame can be integrated into the warped sprite to 
15 dynamically generate the sprite at time f. As for the fixed reference, if no pixel values exist in the sprite at the coordi- 
nates corresponding to the new frame, the current frame is used to initialize the sprite for these coordinates. 

In this case also, the coordinates of the warped sprite are unlikely to coincide with the current frame, therefore, 
interpolation and normalization may be necessary For the time-varying reference, a 0 th -order interpolation is preferred. 
Due to the forward warping using the 0 ,h -th order interpolation, black lines having a width of one pixel may be introduced 
20 in the sprite. These lines can be detected by their zero pixel intensity values, and can be reinterpolated from neighbor- 
ing pixels. 

Applications to Video Encoding 

25 Figure 7 shows a general arrangement 700 of encoder, decoders, and memories to enable video encoding. Frames 
710 are encoded by an encoder 720 using, for example, MPEG encoding techniques. The encoded frames include 
motion information. The frames are decoded by local decoder 730. 

A process 740. as described above, dynamically builds one or more local sprites in a local memory 750. The 
encoded frames and motion information are also transmitted via a network to a remote decoder. The decoded frames, 

30 motion information, and shape of the video object are used to build one or more remote sprites in a remote memory 770 
using process 780. The steps of processes 740 and 780 are isomorphic. 

Since the local and remote sprites are both built from decoded frames, the local and remote frames have the iden- 
tical lossy characteristics. Therefore, as an advantage, only the rigid motion information, the shape, and the residual 
error information need to be transmitted in order to construct identical sprites in the local and remote memories. 

35 

Two-way Predictive Encoding 

As shown in Figure 8, the sprites stored in the memories enable a two-way motion compensated prediction encod- 
ing for video data representing real scenes. 

40 First, in a motion compensated block-DCT (discrete cosine transform) scheme, each block of video data of a frame 
can predictably be selected in real-time. In a blocking scheme, each frame is partitioned into a plurality of blocks of reg- 
ularized arrays of pixels, for example, each block is an eight by eight array of pixels. 

Each block is selected for encoding either from one of the sprites built in the memory based on rigid motion param- 
eters and sprite prediction 840, or from a previous frame based on local motion parameters and intraframe prediction 

45 850. The multi-way selection is determined by which prediction methodology yields a minimum prediction error, e.g., the 
sum of the absolute difference between the prediction error for sprite prediction and the intraframe prediction method- 
ologies. The selection can be indicated to the decoder by an identification associated with each block of pixels. 

Thus, pixel data representing uncovered rigidly moving regions of a scene can be retrieved from the long-term 
memory provided that the data have previously been observed. Furthermore, the parametric rigid motion estimation 

so efficiently handles camera motion such as panning and zooming, resulting in reduced motion side information. Finally, 
the process of integrating many frames into the dynamic sprites also filters the noise in the sequence of frames, leading 
to a better prediction. 

In addition, the sprites stored in the memory can also be used to construct a high resolution panoramic view of the 
background portion of the scene. With dynamically generated sprites, foreground components of the scene may be 
55 entirely eliminated, or only partially observable. 

An attached Appendix describes additional details and algorithms which can be used for motion estimation, warp- 
ing, mosaicking, motion summation, interpolation, and sprite prediction. 

Having described a preferred embodiment of the invention, it will now become apparent to one of skill in the art that 
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other embodiments incorporating its concepts may be used. It is, therefore, the applicant's intention that this invention 
should not be limited to the disclosed embodiment only, and other modifications and embodiments are conceivable 
within the scope of the inventions. 



Appendix 
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This appendix describes additional details of the method of generating dynamic sprites 
for video encoding as set out above. 

This technique requires three main stages: segmentation of video objects which undergo a 
coherent rigid motion, parametric global motion estimation, and progressive building of a 
dynamic sprite for each of the video objects. 
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1. Rigid Motion Estimation 

Rigid motion estimation is performed on each of the video objects by means of a 
generalized hierarchical parametric matching technique. Basically, the motion parameters 

5-{al a9)are obtained by minimizing a disparity measure between the video object 

R at time t and the mapped region at time t - l 

min£||/(r,0-/(r(r t a),r-I)l , 

where7(r,r) denotes the image intensity at location r and timer , T{r,a) is the location 
to be matched in the previous frame, and \\\\ denotes the distance measure. 

2. Warping 

The motion parameters between time t - 1 and time / are denoted by 
a(r- 1,0 = (al ? ...,a9). Those parameters define the relation between the pixel 

coordinates F = (x, y) T at time t and f ' = (x\ y') r at time r - 1 (where the time axis is 
normalized such as the sampling interval Ar = 1 ). 

A forward transform T f [ ] which gives the pixel r = (.r, v) r corresponding to the pixel 
-r-=- (-xVy-)-at-time-r^l^^^ 



so 



v , 



( a\ + a2x' + a3y' ^ 

alx' + a%y' + a9 
a4 + g5x' + a6y / 

\alx' + a%y' + a9j 
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? = T / [?\d(t-lt)}. 

Similarly, a backward transform T h [ ] which gives the pixel r' = (jt\y') r corresponding 
to the pixel r = (.r, y) r at time / can be defined as 



f a3a4-a\a6 + (a6a9-ala&)x + (a\a& - a3g9) y y 
ala6 - a 3a 5 + {a5aS - a6al)x -r (a3a7 - ala&)y 
c\a5 - fl2g4 * (g4a7 - a3fl9).t t (g2a9 - olfl7) v 

k a2a6 - c3a5 + {a5aS - a6<z7).r + (a3a7 - alaZ)y j 



or 



F' = 7,[r.a(r-1./)]. 



It can also be noted that the affine model is a particular case of the perspective model 
where al = aS = 0 and a9 = I. For instance, in the afftne case the forward transform 
becomes 



f x\(al+a2x' + a3y' y 
^y J ^a4 + a5x' + s6y'; 



and the backward transform becomes 

r a3a4-a\a6 + a6x-a3x\ 



a2a6~a3a5 
glflS - g2a4 - a5x + al y 

ala6-a3a5 



3. Dynamic Sprites 
3.1 Fixed Reference 

First consider the case of a fixed reference corresponding to the first frame at time f 0 = 0 . 
Here, the dynamic sprite is in a fixed coordinate system, and the current frame at time t 
is warped to this coordinate system to be blended with the sprite at time r - I in order to 
generate the new sprite at time t . 

More formally, the dynamic sprite is updated as follow. If the sprite exists already at the 
spatial location R , then the update is given by 



8 



EP0 849 950 A2 

W(£.0 = (l-a*6(r))- M(Rj-\) + a-8(r)-W H [l(?.t).5(t Q .t)\. 

otherwise the sprite at this location R is initialized with the current video object (also 
used to initialize the sprite at time r 0 = 0) 

MCRj) = 8(?)W b [K?j)Mt 0 *t)}. 

where /(r,f) denotes the image intensity at location r and time / 1 VVj/.a] represents 
the backward warping of image / using motion parameters 5, M(Rj) denotes the 
dynamic sprite at location R and time t , a is a weighting factor such as a € [0,l], 5(F) 
is the binary segmentation mask for the video object, i.e. S(r) = I if r belongs to the 
object and 0 otherwise, and a(t 0 j) is the global motion over the time period from time r 0 
to time t . 

The parameter a controls the update of the sprite and can be dynamically adapted. 
3.1.1 Summing motion parameters 

To compute a(r 0 ,r), the pairwise motion parameters 5(f 0 *0' ^(v':) 5(r-I,/) 

have to be added. The pairwise addition is done by the following formula which can be 
iterated to sum the parameters from time t to r 0 . 

If £('o» r i) = ( fl l are the motion parameters between time r 0 and r p and 

5(f,.f 2 ) = (M f ...,i9) are the motion parameters between time r, and r 2> then the motion 
parameters 5(r 0 ,r : ) = (cl,... t c9) between time t 0 and f ; . are given by 

cl = Ma9 + Wal + Wa4 
c2 = b\al + blal + bla5 
c3 = Ma8 + 62a3 + Ma6 
c4 = Ma9 + fc5al + f>6a4 
c5 = Ma7 + f>5a2 + Z?6<z5 
c6 = b4a8 + b5a3 + b6a6 

c-7-=-67a-2-+-fe8a5-+-69a-7 

c8 = £7a3 + b&a6 + b9a% 
* c9 = 67al + 68a4 + 69a9 
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3.1.2 Warping and interpolation 

The backward warped image u;[/(r,f). a(t Q ,t)] is computed by bilinear interpolation as 
follow. If the warped pixel is V = T b [r 9 a{t 0 j)\, and A.B,C,D are the pixels at the grid 
location. The pixel r contributes to the pixels A,B t CD of the warped image as follow: 

A+ = (l-it) (l-<v). /(?,/) , 
fl+ = (l-<ir) -dy-Hrj) , 
C+ = dx(l-</y)./(r ( r) , 
D+ = dxdyI{rj) f 

where rf, and rf v correspond to the distance between the warped pixel r' and the grid 
location A. 

In order to correctly normalized the warp. image, the weight of each contribution is stored 

normA+ = (l-ic) (l-rfy) , 
no/7nfi+ = (l-dx) c/y , 
normC+ = dx - (1 - dy) , 
normD+ -dx dy . 

The operation is iterated for all the pixel r in the image I(rj). Finally, the pixels of the 

warped image (for which a contribution has been made, i.e. norm > 0) are normalized as 
follow 

A = AjnormA f 
B = BjnormB , 
C = CjnormC , 
D = D/normD . 

3.2 Time-varying Reference 

The case of a time-varying reference corresponding to the current frame is similar. Now 
the sprite at time t - 1 is warped to the coordinate system of the current frame, and the 
current frame at time r is then blended with the warped sprite in order to generate the new 
sprite at time t . 

More precisely, the dynamic sprite is updated as follow. If the sprite exists already at the 
spatial location R , then the update is given by 
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^f{RJ) = {\-a■S(?))'W / [^^(RJ-\U([-l^)] + a•5(?)^(?J) t 

otherwise the sprite at this location R is initialized with the current video object (also 
used to initialize the sprite at time r 0 = 0) 

bnR.t) = 8{7)!(r,t), 

where represents the forward warping of sprite M using motion parameters 

a . 

The parameter a controjs the update of the sprite and can be dynamically adapted. 
3.2.1 Warping and interpolation 

The forward warped sprite W f [M(Rj-\),a(t-lj)]is computed by O^-order 
interpolation as follow. If the warped pixel is R' = T f [R 1 a{t-lj)]. The pixel R 
contributes only to the pixel closest to R' , e.g.: 

A+= M(Rj-\) t 
B+ = 0 , 

c+ = o , 

£>+ = 0 . 

In order to correctly normalized the warp image, the weight of each contribution is stored 

normA+ = I , 
normB+ = 0 , 
normC+ = 0 , 
normD+ = 0 . 

The operation is iterated for all the pixel R in the sprite M(R, f) . 



Finally, the pixels of the warped sprite (for which a contribution has been made, i.e. 
norm > 0) are normalized as follow ■ 




A - AjnormA , 
B — BjnormB , 
C = C/normC , 
D = D/normD . 



0B49950A2 I > 
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4. Video Coding using Dynamic Sprites 



4.1 Sprite Prediction 

The sprite prediction is obtained by backward prediction. 

The prediction 7 (?.>) of the image l{rj) is given in the case of a fixed reference sprite 
by 

7(?J) = W b [M(Rj)Mt 0 J)]. 
and in the case of a time-varying reference by 

7(rj) = W>[M(Rj\a(t-l,t)]. 
At this stage a bilinear interpolation is always used. 



Claims 

1 . A computer implemented method for processing video data representing a scene, comprising the steps of: 

segmenting the video data into rigidly and non-rigidiy moving video objects; 
estimating motion information for the rigidly moving objects; and 

integrating the rigidly moving objects, using the motion information, into corresponding sprites stored in a 
memory. 

2. The method of claim 1 further comprising: 

estimating local motion vectors for the video data; 

clustering the local motion vectors to determine dominant motions; and 

segmenting the video data having local motion vectors substantially similar to the dominant motions as the rig- 
idly moving video objects. 

3. The method of claim 2 further comprising: 

determining first residual information present in a prediction error of the dominant motions; 
determining second residual information present in a prediction error of the local motion vectors; 
segmenting the video data as the rigidly moving video objects if the first residual information are substantially 
similar to the second residual information, otherwise, 
segmenting the video data as non-rigidly moving video objects. 

4. The method of claim 1 further comprising: 

estimating the dominant motions using a parametric motion model, the parametric motion model wherein is a 
perspective motion model. 

5. The method of claim 2 wherein there is one video object and a corresponding sprite for each dominant motion. 
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6. The method of claim 1 wherein the video data are organized as a temporal sequence of frames, and further com- 
prising: 

aligning the frames with respect to a coordinate system further comprising: 

5 

aligning the frames with respect to a first frame of the temporal sequence of frames to establish a fixed ref- 
erence coordinate system for the sprites, wherein further, each frame and each sprite includes a plurality 
of pixels, each pixel representing a light intensity value of the scene, and further comprising: 

10 warping pixels of a current frame of the temporal sequence of frames to the fixed reference coordinate 

system of the sprites using a backward transform prior to integrating. 

7. The method of claim 6 further comprising: 

75 aligning the sprites with respect to a current frame of the temporal sequence of frames to establish a time-var- 

ying coordinate system for the sprites, wherein each frame and each sprite includes a plurality of pixels, each 
pixel representing a light intensity value of the scene, and further comprising: 

warping pixels of the sprites to a current frame of the temporal sequence of frames using a forward trans- 
20 form prior to integrating. 

8. The method of claim 1 further comprising: 

encoding the video data, and motion and shape information as encoded data using an encoder of a first 
25 processing system; 

decoding the encoded data using a first decoder to build a first sprite in a first memory of the first processing 
system; 

transmitting the encoded data to a second processing system; 

decoding the encoded data using a second decoder to build a second sprite in a second memory of the second 
30 processing system, the first and second sprites being substantially identical, wherein the video data are organ- 

ized as a temporal sequence of frames, and each frame includes a plurality of block of pixels, and further com- 
prising: 

selecting a particular block of pixels of a current frame of the temporal sequence of frames for encoding 
35 from the first sprite or from a previous frame of the temporal sequence of frames, further comprising: 

encoding the particular block using the motion information rf the particular block is selected for encod- 
ing from the first sprite; and 

encoding the particular block using the local motion vectors if the particular block is selected for 
40 encoding from the previous frame. 

9. The method of claim 8 further comprising: 

determining a first prediction error for the particular block based on encoding the particular block from the first 
45 sprite; 

determining.a.second-prediction-error_for-the-particular-block based on-enGOding-the-particular-block-from-the 

previous frame; and 

selecting the particular block for encoding from the first sprite if the first prediction error is less than the second 
prediction error, otherwise selecting the particular block for encoding from the previous frame. 



so 



10. The method of claim 6 further comprising: 



selecting one of the sprites having the fixed reference coordinate system to construct a panoramic view of a 
background portion of the scene represented by the video data, the panoramic view only including rigidly mov- 
55 ing video objects. 

1 1 . An apparatus for processing video data representing a scene, comprising : 
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a camera acquiring video data representing light intensity values of a scene; 

means for segmenting the video data into rigidly and non-rigidly moving video objects; 

means for estimating motion information for the rigidly moving objects; and 

a memory storing an integration of the rigidly moving objects as dynamic sprites. 
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